Complex Sampling in National Surveys
Pusat Penyelidikan Penyakit Tak Berjangkit, Institut Kesihatan Umum
Sunday, 16 November 2025
In describing a population, we often use a handful of samples rather than the whole population.
Unfortunately, sample distribution may differ from the population - gender, ethnicity, age.
Small studies typically limit their sample; clearly define the target population using inclusive and exclusive criteria.
But national surveys, including health surveys, require the sample to represent the general population (e.g., adult population, older person population, maternal and child population).
pacman::p_load(tidyverse, arrow)
pyr_df <- read_parquet("https://storage.dosm.gov.my/population/population_malaysia.parquet") %>%
filter(date == as.Date("2025-01-01"), sex %in% c("male", "female"),
age != "overall", ethnicity == "overall") %>%
mutate(pop_k = population, pop = if_else(sex == "male", -pop_k, pop_k),
age0 = readr::parse_number(age), age = fct_reorder(age, age0))
my_pyr_plot <- ggplot(pyr_df, aes(x = age, y = pop, fill = sex)) +
geom_col(width = 0.9) + coord_flip() +
scale_y_continuous(limits = c(-2000, 2000), breaks = seq(-2000, 2000, 500),
labels = function(x) scales::comma(abs(x)),
expand = expansion(mult = c(0.02, 0.02))) +
labs(title = "Malaysia Population Pyramid, 2025", x = "Age group (years)",
y = "Population (thousands)", fill = "Sex") +
theme_minimal(base_size = 13) + theme(panel.grid.minor = element_blank())
my_pyr_plotStructured selection – Instead of simple random sampling, respondents are chosen through stratified and clustered sampling to ensure representation across diverse groups.
Unequal probabilities – Some groups are oversampled (e.g., small states, older adults) to obtain reliable estimates, necessitating the use of sampling weights to correct for these differences.
Design-based inference – Analysis must account for the survey’s design, including strata, clusters, and weights,so that standard errors and prevalence estimates accurately reflect the true population.
Sampling: We use a sample to estimate the population efficiently, saving time, cost, and resources while still capturing key characteristics.
Stratification: Stratifying (by gender, ethnicity) ensures all important subgroups are represented and improves precision of estimates.
Clustering: Clustering respondents by area makes data collection logistically practical and cost-efficient.
| Category | Overall % | 95% CI | Male % | 95% CI | Female % | 95% CI |
|---|---|---|---|---|---|---|
| Malaysia | 15.6 | 14.4–16.9 | 15.0 | 13.6–16.5 | 16.2 | 14.7–18.0 |
| Age Group | ||||||
| 18–29 | 3.2 | 2.2–4.6 | 3.7 | 2.2–6.1 | 2.6 | 1.7–4.1 |
| 30–39 | 6.5 | 5.2–8.1 | 6.9 | 5.0–9.3 | 6.0 | 4.5–7.9 |
| 40–49 | 15.2 | 13.2–17.4 | 13.7 | 11.1–16.8 | 16.8 | 14.2–19.8 |
| 50–59 | 28.8 | 25.0–33.0 | 28.4 | 24.2–33.0 | 29.3 | 24.4–34.7 |
| 60+ | 38.0 | 35.4–40.7 | 37.7 | 34.0–41.5 | 38.4 | 35.0–41.8 |
| Ethnicity | ||||||
| Malay | 16.2 | 15.1–17.4 | 15.5 | 14.1–17.1 | 16.9 | 15.4–18.4 |
| Chinese | 15.1 | 11.6–19.5 | 14.8 | 11.2–19.3 | 15.5 | 11.0–21.3 |
| Indian | 26.4 | 22.1–31.2 | 28.4 | 22.1–35.7 | 24.5 | 19.4–30.4 |
| B. Sabah | 9.3 | 7.3–11.8 | 9.5 | 6.8–13.0 | 9.1 | 6.5–12.6 |
| B. Sarawak | 17.2 | 13.0–22.3 | 14.9 | 10.4–21.0 | 19.3 | 14.3–25.6 |
| Others | 10.2 | 7.5–13.6 | 10.0 | 6.6–14.8 | 10.6 | 6.4–17.0 |
# A tibble: 30 × 6
# Groups: age_group, gender [10]
age_group gender ethnicity n dm_prev n_dm
<chr> <chr> <chr> <int> <dbl> <int>
1 18-29 male chinese 16 6.25 1
2 18-29 male indian 12 8.33 1
3 18-29 male malay 52 3.85 2
4 18-29 female chinese 24 4.17 1
5 18-29 female indian 18 5.56 1
6 18-29 female malay 78 2.56 2
7 30-39 male chinese 16 6.25 1
8 30-39 male indian 12 8.33 1
9 30-39 male malay 52 7.69 4
10 30-39 female chinese 24 4.17 1
# ℹ 20 more rows
# A tibble: 10 × 5
# Groups: age_group [5]
age_group gender n dm_prev n_dm
<chr> <chr> <int> <dbl> <int>
1 18-29 male 80 5 4
2 18-29 female 120 3.33 4
3 30-39 male 80 7.5 6
4 30-39 female 120 6.67 8
5 40-49 male 80 15 12
6 40-49 female 120 18.3 22
7 50-59 male 80 31.2 25
8 50-59 female 120 31.7 38
9 60+ male 120 40 48
10 60+ female 180 41.7 75
# A tibble: 5 × 4
age_group n dm_prev n_dm
<chr> <int> <dbl> <int>
1 18-29 200 4 8
2 30-39 200 7 14
3 40-49 200 17 34
4 50-59 200 31.5 63
5 60+ 300 41 123
# A tibble: 1 × 3
n dm_prev n_dm
<int> <dbl> <int>
1 1100 22 242
tibble(age_group = c("18-29","30-39","40-49","50-59","60+"),
n_total = c(200, 200, 200, 200, 300)) %>%
mutate(male = as.integer(round(.4*n_total)),
female = n_total - male) %>%
pivot_longer(male:female, names_to = "gender", values_to = "n_gender") %>%
mutate(malay = as.integer(round(.65*n_gender)),
chinese = as.integer(round(.2*n_gender)),
indian = n_gender - malay - chinese) %>%
pivot_longer(malay:indian, names_to = "ethnicity", values_to = "n_ethnic") %>%
uncount(n_ethnic) %>%
select(-starts_with("n_")) %>%
group_by(age_group) %>%
mutate(age = case_when(age_group == "18-29" ~ sample(18:29, n(), replace = T))) %>%
ungroup() %>%
mutate(dm = c(rep(0, 50), rep(1, 2), rep(0, 15), rep(1, 1), rep(0, 11), rep(1, 1),
rep(0, 76), rep(1, 2), rep(0, 23), rep(1, 1), rep(0, 17), rep(1, 1),
rep(0, 48), rep(1, 4), rep(0, 15), rep(1, 1), rep(0, 11), rep(1, 1),
rep(0, 73), rep(1, 5), rep(0, 23), rep(1, 1), rep(0, 16), rep(1, 2),
rep(0, 45), rep(1, 7), rep(0, 14), rep(1, 2), rep(0, 9), rep(1, 3),
rep(0, 65), rep(1, 13), rep(0, 20), rep(1, 4), rep(0, 13), rep(1, 5),
rep(0, 37), rep(1, 15), rep(0, 12), rep(1, 4), rep(0, 6), rep(1, 6),
rep(0, 55), rep(1, 23), rep(0, 18), rep(1, 6), rep(0, 9), rep(1, 9),
rep(0, 49), rep(1, 29), rep(0, 16), rep(1, 8), rep(0, 7), rep(1, 11),
rep(0, 72), rep(1, 45), rep(0, 23), rep(1, 13), rep(0, 10), rep(1, 17)))Complex Sampling Design in NHMS